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Abstract 


In  this  paper  we  describe  restartable  atomic  sequences,  an  optimistic 
mechanism  for  implementing  simple  atomic  operations  (such  as 
Test-and-Set)  on  a  uniprocessor.  A  thread  that  is  suspended  within  a 
restartable  atomic  sequence  is  resumed  by  the  operating  system  at  the 
beginning  of  the  sequence,  rather  than  at  the  point  of  suspension.  This 
guarantees  that  the  thread  eventually  executes  the  sequence  atomically.  A 
restartable  atomic  sequence  has  significantly  less  overhead  than  other 
software-based  synchronization  mechanisms,  such  as  kernel  emulation  or 
software  reservation.  Consequently,  it  is  an  attractive  alternative  for  use 
on  uniprocessors  that  do  not  support  atomic  operations.  Even  on 
processors  that  do  support  atomic  operations  in  hardware,  restartable 
atomic  sequences  can  have  lower  overhead. 

We  describe  different  implementations  of  restartable  atomic  sequences 
for  the  Mach  3.0  and  Taos  operating  systems.  These  systems’  thread 
management  packages  rely  on  atomic  operations  to  implement  higher- 
level  mutual  exclusion  facilities.  We  show  that  improving  the 
performance  of  low-level  atomic  operations,  and  therefore  mutual 
exclusion  mechanisms,  improves  application  performance. 
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Abstract 

In  this  paper  we  describe  restartable  atomic  sequences, 
an  optimistic  mechanism  for  implementing  simple 
atomic  operations  (such  as  TesUAnd-Set)  on  a  unipro¬ 
cessor.  A  thread  that  .s  suspended  within  a  restartable 
atomic  sequence  is  resumed  by  the  operating  sys¬ 
tem  at  the  beginning  of  the  sequence,  rather  than 
at  the  point  of  suspension.  This  guarantees  that  the 
thread  eventually  executes  the  sequence  atomically.  A 
restartable  atomic  sequence  has  significantly  less  over¬ 
head  than  other  software-based  synchronization  mech¬ 
anisms,  such  as  kernel  emulation  or  software  reserva¬ 
tion.  Consequently,  it  is  an  attractive  alternative  for 
use  on  uniprocessors  that  do  not  support  atomic  op¬ 
erations.  Even  on  processors  that  do  support  atomic 
operations  in  hardware,  restartable  atomic  sequences 
can  have  lower  overhead. 

We  describe  different  implementations  of  restartable 
atomic  sequences  for  the  Mach  3.0  and  Taos  operating 
systems.  These  systems’  thread  management  packages 
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rely  on  atomic  operations  to  implement  higher-level 
mutual  exclusion  facilities.  We  show  that  improving 
the  performance  of  low-level  atomic  operations,  and 
therefore  mutual  exclusion  mechanisms,  improves  ap¬ 
plication  performance. 

1  Introduction 

In  this  paper  we  describe  restartable  atomic  sequences, 
an  optimistic  mechanism  for  implementing  atomic  op¬ 
erations  on  a  uniprocessor.  Our  approach  assumes  that 
short,  atomic  sequences  are  rarely  interrupted.  If  a 
thread  is  interrupted  during  an  atomic  sequence,  we 
rely  on  a  recovery  mechanism  provided  by  the  ker¬ 
nel  that  resumes  the  thread  at  the  beginning  of  the 
sequence.  We  have  implemented  restartable  atomic 
sequences  in  the  Mach  3.0  [Accetta  et  al.  86]  and 
Taos  [Thacker  et  al.  88]  operating  systems,  using  a  dif¬ 
ferent  method  in  each.  We  show  that  restartable  atomic 
sequences  are  significantly  more  efficient  than  other 
software  techniques.  We  have  measured  performance 
improvements  of  up  to  50%  for  some  applications  on  the 
MIPS  R3000-based  [Kane  87]  DECstation  5000/200, 
which  does  not  have  hardware  support  for  atomic  op¬ 
erations.  In  addition,  we  show  that  restartable  atomic 
sequences  outperform  hardware  mechanisms  on  proces¬ 
sors  that  do  provide  explicit  support  for  atomic  opera¬ 
tions. 

1.1  Motivation 

Multithreaded  programs  use  mutual  exclusion  to  guar¬ 
antee  consistency  of  shared  data  structures.  Mutual  ex¬ 
clusion  mechanisms  such  as  P,  V  [Dijkstra  68a]  and  ac- 
quirc.mutex,  release.mutex  [Birrell  91]  are  implemented 
using  lower-level  operations  such  as  Test-And-Set  that 
grant  one  of  several  threads  mutually  exclusive  access 
to  some  data  structure.  Even  on  a  uniprocessor,  mutual 


exclusion  is  necessary  to  protect  shared  data  against 
an  interleaved  thread  schedule.  Interleaving  can  oc¬ 
cur  when  a  thread  is  suspended  (due  to  a  synchronous 
fault  or  an  asynchronous  preemption),  or  when  a  thread 
blocks  (due  to  the  thread  voluntarily  relinquishing  the 
processor). 

Efficient  mutual  exclusion  mechanisms  are  becoming 
increasingly  important  on  uniprocessors  for  two  rea¬ 
sons.  First,  modern  applications  use  multiple  threads 
as  a  program  structuring  device,  as  a  mechanism  for 
portability  to  multiprocessors,  and  as  a  way  to  man¬ 
age  I/O  and  server  concurrency  even  when  no  true 
CPU  parallelism  is  available.  Second,  many  operating 
systems  today  are  built  on  top  of  a  microkernel  that 
supports  relatively  few  services;  for  example  thread 
scheduling,  virtual  memory  and  interprocess  commu¬ 
nication  [Mullender  et  al.  90,  Cheriton  88,  Rozier  et  al. 
88,  Accetta  et  al.  86,  Thacker  et  nl.  88].  Other  services 
such  as  the  file  system  and  networking  are  implemented 
as  multithreaded  user-level  applications.  The  micro¬ 
kernel  approach  exposes  the  performance  of  a  system’s 
mutual  exclusion  primitives;  even  single  threaded  pro¬ 
grams  rely  on  basic  operating  system  services  that  are 
implemented  outside  the  kernel  using  multiple  threads. 
The  performance  of  all  applications  is  therefore  ulti¬ 
mately  influenced  by  the  performance  of  the  underlying 
mutual  exclusion  mechanisms. 

The  mechanisms  that  have  been  used  to  implement 
atomic  operations  on  a  uniprocessor  (i.e,,  those  de¬ 
scribed  in  every  undergraduate  operating  systems  text¬ 
book)  can  be  characterized  as  pessimistic.  That  is, 
their  design  assumes  that  atomicity  may  be  violated 
at  any  moment  (e.g.,  with  an  interrupt),  and  therefore 
guards  against  this  potential  violation  every  time  the 
atomic  operation  is  executed.  This  approach,  though, 
can  incur  a  high  overhead  that  affects  the  performance 
of  applications  relying  on  mutual  exclusion,  either  di¬ 
rectly  or  indirectly. 

In  contrast,  the  optimistic  mechanism  described  in 
this  paper  assumes  that  atomic  sequences  are  rarely 
interrupted,  and  adopts  an  inexpensive  solution  for  this 
assumed  common  case.  We  show  that  this  assumption 
is  both  accurate,  and  effective  at  reducing  the  overhead 
of  mutual  exclusion. 

1.2  The  rest  of  this  paper 

In  the  next  section  we  describe  restartable  atomic  se¬ 
quences  after  reviewing  several  pessimistic  techniques 
for  ensuring  mutual  exclusion  on  a  uniprocessor.  In 
Section  3  we  discuss  implementations  of  restartable 
atomic  sequences  for  the  Mach  and  Taos  operating  sys¬ 
tems.  In  Section  4  we  discuss  some  of  the  kernel  de¬ 
sign  issues  that  arise  when  implementing  restartable 
atomic  sequences.  In  Section  5  we  show  the  perfoT- 
mance  impact  of  using  restartable  atomic  sequences  in 
Ux  M^fh  opetali'ig  lyaUnf  In  St>rfk>n  G  wir  »hoW  [flat 
restartable  atomic  sequences  have  less  overhead  than 
equivalent  hardware  mechanisms  on  several  processor 


architectures.  In  Section  7  we  dixuss  related  work.  In 
Section  8  we  present  our  conclusions. 

2  Implementing  mutual  exclu¬ 
sion  on  a  uniprocessor 

This  section  describes  four  techniques  for  implementing 
atomic  primitives  suitable  for  use  by  mutual  '.elusion 
mechanisms  on  a  uniprocessor.  We  concentrate  on  the 
specific  atomic  primitive  Test-And-Set,  although  other 
primitives,  such  as  Feich-And-Add,  Load-Linkcd/Siore- 
Conditional,  and  Memory- Register-Exchange  could  be 
similarly  constructed.  Each  of  these  primitives  per¬ 
forms  an  atomic  read-modify-write  of  a  single  mem¬ 
ory  location.  Three  of  the  techniques,  memory  inter¬ 
locked  instructions,  software  reservation  and  kernel  em¬ 
ulation,  are  pessimistic.  The  fourth,  restartable  atomic 
sequences,  is  based  on  the  optimistic  approach. 

2.1  Memory-interlocked  instructions 

Memory-interlocked  instructions  (or  instruction  se¬ 
quences)  require  special  hardware  support  from  the 
processor  and  bus  to  ensure  that  a  given  memory  loca¬ 
tion  can  be  read,  modified  and  written  without  inter¬ 
ruption.  Memory-interlocked  instructions  are  primarily 
intended  to  support  multiprocessing,  but  can  be  used 
on  uniprocessor  systems  as  well.  Unfortunately,  not 
all  processors  support  memory-interlocked  instructions, 
and  many  that  do,  do  so  reluctantly;  i.e.,  the  cycle  time 
for  an  interlocked  access  is  several  times  greater  than 
that  for  a  non-interlocked  access.  The  reasons  for  the 
higher  cost  are  increased  complexity  [Intel860  89],  an 
overly  rich  set  of  atomic  operations  [Leonard  87,  In- 
tel386  90],  support  for  atomic  updates  on  arbitrary  bit 
boundaries  [Leonard  87],  and  the  fact  that  atomic  op¬ 
erations  may  bypass  the  on-chip  cache  [Motorola  88100 
88].  A  good  survey  of  memory-interlocked  instructions 
and  their  implementations  can  be  found  in  [Glew 
Hwu  91]. 

2.2  Software  reservation 

Atomic  operations  can  also  be  constructed  using  soft¬ 
ware  reservation  algorithms,  such  as  Dekker’s  [D  jkstra 
68b],  Peterson’s  [Peterson  81]  or  Lamport’s  [Lamport 
87].  Roughly  speaking,  with  software  reservation  algo¬ 
rithms,  a  thread  must  register  its  intent  to  perform  an 
atomic  operation  and  then  wait  until  no  other  thread 
has  registered  a  similar  intent  before  proceeding.  We 
use  Lamport’s  fast  mutual  exclusion  algorithm  to  eval¬ 
uate  software  reservation  schemes  since  it  has  been 
proven  cuv'.t.'t'l  and  showii  to  H  Oiie  is  will¬ 

ing  to  pu,  an  upper  bound  on  the  duration  of  the  crit- 
wai  then  U  i*  pow’-'le  ■  •  npi**  luri  :nultii.jv> 

cessor  mutual  exclusion  vith  fewo:  instructions  than 
required  by  Lamport’s  algorithm.  Such  a  limitation. 


though,  is  generally  not  feasible  on  a  multiprocessor, 
and  would  be  nearly  impossible  on  a  uniprocessor. 

In  Lamport’s  algorithm,  shown  in  Figure  1,  each 
thread  has  a  unique  identifier  i  which  is  used  to  place 
reservations  into  the  variable  x,  and  to  indicate  own¬ 
ership  of  the  lock  via  the  variable  y.  In  the  normal 
case  (no  contention,  no  collision),  Lamport’s  algorithm 
requires  two  loads  and  five  stores,  executing  in  order 
the  lines  [1,2,3,9,10,19,21,22].  If  a  thread  reaches  line 
3,  though,  and  finds  that  the  lock  is  held  by  another 
thread,  there  is  coniention,  and  the  thread  must  wait 
until  the  lock  is  released.  The  array  b  is  used  to  resolve 
collisions,  which  occur  whenever  two  or  more  threads 
find  that  the  lock  is  free  at  line  3  and  proceed  to  line  9 
simultaneously  (or  through  an  interleaved  schedule  on 
a  uniprocessor).  A  collision  by  n  threads  will  be  de¬ 
tected  at  line  10  by  n  —  1  of  them;  those  n  —  1  will  enter 
the  loop  at  line  12  and  wait  until  the  collisions  have 
settled  out  (lines  12  through  15).  The  await  used  at 
lines  5,  12  and  14  is  necessary  when  there  is  contention 
or  collision,  and  can  be  implemented  on  a  uniprocessor 
by  having  the  awaiting  thread  yield  its  processor  to  the 
scheduler. 

start : 

1  b[i]  :*  true; 

2  X  :■  i; 

3  if  y  <>  0  then  {  Contention  } 

4  b[i]  :■  false; 

5  asait  (y  ■  0) ; 

6  goto  start; 
end; 

9  y  :■  i; 

10  if  X  <>  i  then  {  Collision  } 

11  b[i]  :■  false; 

12  for  j  :■  1  to  N  aeait  (b[j]  ■  false); 

13  if  y  <>  i  then 

14  aeait  (y  ■  0); 

15  goto  start; 

16  end; 

17  end; 

18 

19  CRITICAL  SECTION 

20 

21  y  0: 

22  b[i]  :»  false; 

Figure  1:  Lamport’s  fast  mutual  exclusion  algorithm. 

Although  reservation- based  algorithms  such  as  Lam¬ 
port’s  are  correct  in  principle,  they  are  in  practice  un¬ 
wieldy,  having  storage  requirements  that  are  0{n  x  /), 
where  n  is  the  maximum  number  of  threads  that  may 
be  simultaneously  active,  and  /  is  the  maximum  num¬ 
ber  of  synchronization  objects. 

The  space  requirement  can  be  reduced  to  0{n)  with 
a  single  “meta-atomic  object"  which  is  used  to  control 
access  to  all  “regular  atomic  objects.”  In  this  case, 
the  critical  section  at  line  19  in  Figure  1  becomes  a 


function  Meta-At0Bic-Te8t-And-Set(var  p:  integer) 
: integer; 

var  result:  integer; 
begin 

[  lines  1-18  froB  Lamport’s  algorithm  ] 
if  (p  ■  0)  then 
result  :»  0; 
p  -  1: 
else 

result  : *  1 ; 
end; 

[  lines  21-22  from  Lamport’s  algorithm  ] 
return  result ; 

end  Meta-Atomic-Test-And-Set ; 

procedure  AtonicClear(var  p:  integer) 
begin 

P  :■  0; 

end  AtomicClear; 


Figure  2:  Bundled  Test-And-Sei  using  Lamport’s  algo¬ 
rithm. 


code  sequence  to  access  the  “regular  atomic  object.” 
For  example,  we  can  bundle  the  reservation  algorithm 
inside  a  Test-And-Set  procedure  (see  Figure  2). 

Even  though  bundling  reduces  the  space  requirement 
for  an  atomic  Test-And-Set  variable  to  one  bit  (space 
for  the  meta  variables  x,  y,  and  b  can  be  counted  as 
constant  system  overhead),  it  increases  the  number  of 
memory  accesses  to  enter  and  exit  a  critical  section 
to  at  least  three  loads  and  seven  stores.  Additionally, 
bundling  serializes  all  atomic  operations,  even  those 
for  unrelated  synchronization  objects.  On  a  uniproces¬ 
sor,  for  example,  a  thread  preempted  during  the  func¬ 
tion  Meta-Atomic-Test-And-Set  would  prevent  other 
threads  from  executing  any  atomic  operation. 


2.3  Kernel  emulation 

Memory-interlocked  instructions  and  software  reserva¬ 
tion  protocols  work  on  both  uniprocessors  and  multi¬ 
processors.  A  strictly  uniprocessor  solution  has  the  ker¬ 
nel  export  its  ability  to  perform  atomic  operations  to 
applications  by  means  of  a  system  call  that  does  an 
atomic  read-modify-write  on  a  memory  location  in  the 
caller’s  address  space.  In  the  kernel,  processor  inter¬ 
rupts  are  uisabled  during  the  e.xecution  of  the  atomic 
operation. 

Although  kernel  emulation  requires  no  special  hard¬ 
ware,  its  runtime  cost  is  high.  The  kernel  must  be 
invoked  on  every  synchronization  operation,  requiring 
that  a  trap  be  fielded  and  dispatched,  state  saved  and 
restored,  and  arguments  checked.  On  the  MIPS  R3000, 
for  example,  building  a  Test-And-Set  with  kernel  emu¬ 
lation  takes  about  100  instructions. 


function  Test -And-Set(var  p:  integer):  integer; 
var  result:  mteger; 
begin 

1  result  1; 

2  BEGIN  RESTARTABLE  ATONIC  SEQUENCE 

3  if  p  ■  1  then 

4  result  :■  0; 

5  else 

6  p  :■  1; 

7  end; 

8  END  RESTARTABLE  ATONIC  SEQUENCE 

9  return  result; 

end  Test-And-Set ; 

Figure  3:  Generic  Tesi-And-Set  using  a  restartable 
atomic  sequence. 


2.4  Restartable  atomic  sequences 


The  three  mechanisms  described  so  far  are  pessimistic. 
A  memory-interlocked  instruction  implicitly  delays  in¬ 
terrupts  until  the  instruction  completes;  a  software 
reservation  algorithm  explicitly  guards  against  arbi¬ 
trary  interleaving;  kernel  emulation  explicitly  disables 
interrupts  during  operations  that  must  execute  atomi¬ 
cally. 

On  a  uniprocessor,  an  atomic  read-modify-write  op¬ 
eration  can  be  performed  optimistically.  Instead  of  us¬ 
ing  a  mechanism  that  guards  against  interrupts,  we  can 
instead  recognize  when  an  interrupt  occurs  and  recover. 
For  any  read-modify-write  sequence,  the  recovery  pro¬ 
cess  's  straightforward:  restart  the  sequence.  In  this 
way,  when  the  sequence  eventually  completes,  it  will 
have  completed  without  interruption,  i.e.,  atomically. 

An  atomic  Test-And-Set  operation  is  shown  in  Fig¬ 
ure  3.  As  long  as  statements  3  through  7  execute  with¬ 
out  interruption  on  a  uniprocessor,  this  code  will  atom¬ 
ically  read  and  write  the  variable  p.  If  an  interrupt  oc¬ 
curs  that  would  allow  another  thread  to  possibly  mod¬ 
ify  the  variable  p,  then  the  interrupted  thread  must 
resume  execution  at  line  3  when  it  is  next  scheduled. 
The  corresponding  clear  operation  can  store  a  zero  into 
p  as  long  as  single- word  memory  accesses  execute  atom¬ 
ically. 

Restartable  atomic  sequences  are  attractive  because 
they  do  not  not  require  hardware  support,  have  a  short 
code  path  with  one  load  and  one  store  per  atomic  read- 
modify-write  (in  the  common  case  of  no  interruptions), 
and  do  not  involve  the  kernel  on  every  atomic  opera¬ 
tion.  Only  when  an  atomic  instruction  sequence  might 
not  have  executed  atomically  is  it  necessary  to  perform 
a  recovery  action  to  ensure  atomicity.  In  the  next  sec¬ 
tion  we  describe  two  recovery  strategies. 


3  Implementing  restartable 
atomic  sequences 

Restartable  atomic  sequences  require  kernel  support  to 
ensure  that  a  suspended  thread  is  resumed  at  the  be¬ 
ginning  of  the  sequence.  This  section  describes  two 
strategies  for  implementing  that  kernel  support.  The 
first  strategy,  used  by  the  Mach  3.0  kernel,  places 
a  restartable  atomic  sequence  at  a  designated  code 
range  within  a  program.  The  second  strategy,  used 
by  the  Taos  kernel,  constructs  restartable  atomic  se¬ 
quences  out  of  unique  code  fragments  against  which  a 
suspended  thread’s  current  instruction  stream  is  com¬ 
pared.  Both  strategies  have  been  implemented  in  ver¬ 
sions  of  the  operating  systems  running  on  the  MIPS 
R3000-based  DECstation  5000/200. 


3.1  Explicit  registration  in  Mach 

The  Mach  operating  system  implements  a  strategy 
based  on  explicit  registration.  The  kernel  keeps  track 
of  each  address  space’s  restartable  atomic  sequence.  If 
a  thread  is  suspended  within  that  sequence,  it  is  re¬ 
sumed  at  the  beginning.  An  application  registers  the 
starting  address  and  length  of  the  sequence  with  the 
kernel.  The  registration  is  done  automatically  during 
program  initialization  by  C-Threads  [Cooper  &  Draves 
88],  Mach’s  thread  management  package. 

An  address  space  may  register  only  one  restartable 
atomic  sequence  at  a  time.  This  restriction  simplifies 
the  kernel’s  task  of  determining  if  a  suspended  thread 
was  executing  within  a  restartable  sequence.  When 
the  thread  management  system  attempts  to  register 
a  restartable  atomic  sequence  with  a  kernel  that  does 
not  support  such  sequences,  the  registration  fails.  In 
response  to  the  failure,  the  thread  management  sys¬ 
tem  overwrites  the  restartable  atomic  sequence  with 
code  that  uses  a  con ventionaP  mechanism.  Overwrit¬ 
ing  ensures  binary  portability  between  uniprocessors 
and  multiprocessors,  and  binary  compatibility  with 
older  kernels  that  do  not  support  restartable  atomic 
sequences. 

■  A  registered  Test-And-Set  function  can  be  imple¬ 
mented  with  a  single  four-word  (and  four  cycle)  se¬ 
quence  on  a  load/store  RISC  architecture.  For  exam¬ 
ple,  the  assembly  code  for  this  function  on  a  MIPS 
R3000  is  shown  in  Figure  4.  Line  1  loads  the  current 
value  of  the  Test-And-Set  location,  passed  in  register 
aO,  into  the  return  value  register,  vO.  Line  2  loads  a 
temporary  register  with  the  value  1.  Line  3  returns 
control  back  to  the  caller.  Line  4,  which  e.xecutes  in 
the  branch  delay  slot  following  the  return,  stores  a  1 
into  the  Test-And-Set  location.  Lines  1-4  form  the 
restartable  atomic  sequence:  when  the  store  finally  oc¬ 
curs  at  the  end  of  line  4,  no  other  thread  will  have 
executed  since  the  thread’s  most  recent  load  at  line  1. 
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:  Restartable  Tesi-And-Sei  procedure  using  ex- 

plicit  registration  in  Mach  3.0. 

Costs  of  explicit  registration 

There  are  two  runtime  costs  associated  with  explicit 
registration.  Because  the  kernel  identifies  restartable 
atomic  sequences  by  a  single  PC  range  per  address 
space,  they  cannot  be  inlined.  The  inability  to  inline 
slightly  increases  the  overhead  of  atomic  operations  be¬ 
cause  of  the  cost  of  subroutine  linkage. 

The  second  cost  comes  from  having  to  check  the  re¬ 
turn  PC  whenever  a  thread  is  suspended.  Although 
this  test  adds  a  few  tens  of  cycles  to  the  kernel’s  thread 
suspension  path  (which  is  already  several  hundred  cy¬ 
cles  long),  thread  suspensions  occur  far  less  often  than 
atomic  operations,  making  the  additional  scheduling 
overhead  worthwhile. 

3.2  Designated  sequences  in  Taos 

Taos  uses  designated  code  sequences  to  recognize  when 
a  thread  has  been  suspended  within  an  atomic  se¬ 
quence.  The  kernel  compares  the  instruction  stream 
of  a  suspended  thread  against  a  designated  sequence. 
The  comparison  allows  restartable  atomic  sequences  to 
occur  anywhere  in  a  program,  enabling  inlini:’<t  and 
eliminating  the  branch  overhead  of  explicit  registration. 

The  kernel’s  comparison  must  recognize  every  inter¬ 
rupted  sequence  and  reject  any  other  similar  looking 
sequence  since  mistakenly  changing  the  PC  in  such  a 
situation  could  cause  code  to  malfunction.  Taos  uses 
a  two-stage  check  to  unambiguously  recognize  atomic 
sequences. 

The  first  stage  is  a  fast  test  which  rejects  most  in¬ 
terrupted  code  sequences  that  are  not  restartable.  The 
opcode  of  the  suspended  instruction  is  used  as  an  in¬ 
dex  into  a  hash  table  containing  instructions  eligible  to 
appear  in  a  restartable  atomic  sequence.  If  the  opcode 
matches  the  contents  of  the  indexed  entry,  the  test  pro¬ 
ceeds  to  the  second  stage.  The  first  check  is  quite  fast, 
yet  succeeds  in  rejecting  a  large  majority  of  the  non- 
atomic  cases  and  none  of  the  atomic  ones.  The  few  that 
pass  this  check,  comprising  all  of  the  suspended  atomic 
sequences,  plus  a  much  larger  number  of  false  alarms, 
move  on  to  the  second  stage  of  the  check. 

The  second  stage  uses  another  table,  again  indexed 
by  opcode,  to  determine  the  expected  offset  from  the 
suspended  instruction  to  a  “'landmark”  no-op.  The 
landmark  no-op  is  never  emitted  by  the  compiler  un¬ 


der  normal  circumstances,  but  is  present  within  every 
restartable  atomic  sequence.  On  the  R3000,  the  land¬ 
mark  no-op  is  a  non-destructive  register  move  which 
fills  an  otherwise  useless  branch  delay  slot.  If  the  sec¬ 
ond  stage  finds  the  landmark  in  the  expected  position, 
it  recognizes  the  sequence  as  atomic  and  restarts  it. 
Otherwise,  the  sequence  is  rejected  as  a  false  alarm. 

The  designated  sequence  for  acquiring  a  mutex  is 
shown  in  Figure  5.  The  sequence  is  optimistic  in  two 
distinct  senses;  it  assumes  both  that  it  will  not  be  in¬ 
terrupted,  and  that  it  will  find  the  mutex  unlocked. 
Both  assumptions  model  the  frequent  case,  but  either 
or  both  can  fail  independently.  The  sequence  is  es¬ 
sentially  a  Tesi-And-Sei  of  an  entire  word,  where  the 
unlocked  value  of  the  mutex  is  0,  and  the  locked-but- 
nowaiters  value  is  0x80000000.  Typically,  the  sequence 
finds  that  the  mutex  has  the  former  value  and  atomi¬ 
cally  sets  it  to  the  latter.  The  infrequent  case  is  handled 
with  an  out-of-line  kernel  call  via  SlowAcquire.  The 
sequence  for  mutex  release  (Test- And- Clear)  is  similar. 
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Figure  5:  A  restartable  atomic  sequence  for  mutex  ac¬ 
quisition  using  an  inlined  designated  sequence. 

Costs  of  designated  sequences 

Designated  sequences  have  several  costs.  There  is  the 
measurable  cost  of  the  two-stage  check  on  every  thread 
switch.  The  check  is  currently  implemented  in  Mod- 
ula2-l-,  the  language  in  which  the  operating  system  is 
written  [Rovner  et  al.  85].  As  with  Mach’s  explicit  reg¬ 
istration,  the  check  adds  a  few  tens  of  instructions  to 
the  kernel’s  context  switch  path  (counting  instructions 
in  the  generated  code  shows  that  the  check  adds  about 
2  /isecs  on  a  MIPS  R3000  in  the  common  case). 

Unlike  explicit  registration,  which  uses  only  one  se¬ 
quence  that  can  be  overwritten  at  runtime  if  restartable 
atomic  sequences  are  not  supported  on  a  given  system, 
designated  sequences  are  not  portable  between  unipro¬ 
cessors  and  multiprocessors.  The  compiler  must  gener¬ 
ate  a  different  code  sequence  for  each. 

More  generally,  the  use  of  a  designated  sequence 
requires  a  strong  alliance  between  the  compiler  and 
the  operating  system,  since  changes  in  the  way  that 
one  handles  atomic  operations  must  be  reflected  in  the 
other.  The  global  design  properties  of  the  Taos  oper¬ 
ating  system  make  this  linkage  feasible,  however.  The 
crucial  property  of  Taos  is  that  both  the  kernel  and 
its  multithreaded  clients  are  written  in  Modula-2-|-.  In 
this  context,  the  kernel  and  the  compiler  can  cooperate 
closely  to  support  fast  mutual  exclusion  using  desig¬ 
nated  inlined  sequences.  In  contrast,  for  Mach,  which  is 


not  intended  to  be  used  with  any  one  language  and  any 
one  compiler,  such  a  close  alliance  between  the  compiler 
and  the  operating  system  kernel  is  not  feasible. 

4  Kernel  design  considerations 

Section  3  described  two  kernel  techniques  that  support 
fast  mutual  exclusion  with  restartable  atomic  sequence. 
The  implications  of  these  techniques  for  the  inner  work¬ 
ings  of  the  kernel  depend  both  on  the  exact  technique 
chosen  (explicit  registration,  or  designated  sequences) 
and  on  the  design  details  of  the  specific  kernel  involved. 
In  this  section  we  discuss  some  of  these  implications. 

4.1  Placement  of  the  PC  check 

The  most  obvious  question  about  kernel  structure  is: 
when  should  the  kernel  check/adjust  the  PC  of  a  sus¬ 
pended  thread?  The  two  points  at  which  the  thread  can 
be  checked  are  when  it  is  first  suspended,  and  when  it 
is  about  to  be  resumed.  One  could  consider  intermedi¬ 
ate  points,  but  they  are  less  natural  than  either  point 
where  the  kernel  already  has  the  threr.d  in  hand. 

When  using  designated  sequences,  checking  the  PC 
can  cause  a  page  fault  since  it  involves  reading  arbitrary 
user  memory.  If  the  kernel  path  leading  to  suspension 
is  restricted  in  its  ability  to  incur  additional  faults,  as 
it  is  in  Taos  and  many  other  systems,  early  checking 
of  the  PC  with  designated  sequences  can  be  problem¬ 
atic.  Checking  the  PC  late  solves  this  problem,  since 
there  are  generally  fewer  restrictions  on  kernel  excep¬ 
tions  when  coming  out  of  a  context  switch. 

In  Mach,  the  PC  is  checked  when  the  thread  is  sus¬ 
pended  rather  than  when  it  returns  to  user  level.  Since 
only  the  PC,  but  not  its  contents,  are  inspected,  there 
is  no  concern  about  touching  user  memory  at  inoppor¬ 
tune  times.  The  check  is  done  early  because  the  return 
PC  and  reason  for  entry  into  the  kernel  are  conveniently 
available  at  that  point. 

Detection  at  user  level 

Explicit  registration  and  designated  sequences  place 
with  the  kerne!  the  responsibility  for  detecting  and  cor¬ 
recting  atomicity  violations.  An  alternative  approach 
places  that  responsibility  with  the  application  itself: 
whenever  a  suspended  thread  is  resumed  by  the  kernel, 
it  returns  to  a  fixed  user-level  sequence  that  determines 
if  the  thread  was  suspended  within  a  restartable  atomic 
sequence.  If  co,  the  user-level  recovery  code  branches 
to  the  beginning  of  the  sequence,  otherwi.se  it  branches 
to  the  suspended  instruction. 

User-level  detection  is  attractive  because  the  kernel 
provides  only  the  mechanism  to  ensure  atomicity.  The 
policy  lies  with  the  application.  Since  the  kernel  is  not 
involved  in  either  detection  or  correction,  those  pro¬ 
cesses  can  be  made  cis  rich  as  necessary  to  satisfy  the 
atomicity  constraints  of  any  instruction  sequence,  such 


as  those  that  manipulate  wait-free  data  structures  [Her- 
lihy  91],  as  well  as  the  more  conventional  TesUAnd-Set. 

The  user-level  approach  is  not  without  problems, 
however.  TVansferring  first  to  a  fixed  instruction  se¬ 
quence,  and  then  to  the  suspended  instruction  involves 
more  complexity  and  overhead  than  the  simple  check 
made  by  the  kernel  in  either  of  the  other  two  strate¬ 
gies.  There  is  a  level  of  control  indirection  requiring 
that  the  real  return  address  be  saved  and  restored  on 
the  thread’s  user-level  stack  at  each  suspension.  Be¬ 
cause  of  these  problems,  and  because  there  is  little  mo¬ 
tivation  to  create  a  clean  policy/mechanism  separation 
when  there  is  only  one  policy,  neither  Taos  nor  Mach 
provide  for  user-level  detection.* 

4.2  Mutual  exclusion  in  the  kernel 

The  kernel  is  itself  a  client  of  thread  management  facil¬ 
ities  in  both  Mach  and  Taos.  It  is  tempting  to  regard 
the  kernel’s  ability  to  disable  interrupts  as  a  sweeping 
solution  to  the  mutual  exclusion  problem  on  a  unipro¬ 
cessor.  Mach  implicitly  adopts  this  approach  as  the  ker¬ 
nel  is  non-preemptive,  but  is  compiled  for  uniprocessors 
with  all  low-level  synchronization  operations  removed. 
The  Taos  kernel,  however,  is  preemptive,  and  uses  des¬ 
ignated  sequences  just  as  applications  do.  There  are 
two  reasons  for  this.  The  first  is  a  minor  performance 
gain,  since  explicit  disabling  and  reenabling  of  inter¬ 
rupts  would  more  than  double  the  cost  of  synchroniza¬ 
tion  operations.  The  second  reason  is  a  desire  to  use 
the  same  Modula-2-|-  compiler  for  all  code,  whether  it 
be  user  code  or  kernel  code. 

The  use  of  restartable  atomic  sequences  in  both  user 
programs  and  the  kernel  raises  the  question  of  system 
structuring  due  to  potential  recursion.  Two  events,  a 
page  fault  or  a  thread  preemption,  can  trigger  a  thread 
switch  in  the  middle  of  a  restartable  atomic  sequence. 
Since  the  sequence  may  be  in  either  user  or  kernel  code, 
there  are  then  four  events  that  must  be  considered  in 
the  light  of  recursion:  user  page  fault,  user  preemp¬ 
tion,  kernel  page  fault,  and  kernel  preemption.  The 
kernel  uses  mutexes  while  handling  these  events,  so  it 
is  important  to  ensure  that  recursion  does  not  lead  to 
deadlock.  For  example,  a  thread  could  incur  a  user 
page  fault,  be  preempted  while  handling  it  in  the  ker¬ 
nel,  and  upon  resuming  from  the  preemption,  incur  a 
second  page  fault  when  trying  to  do  its  PC  check.  If 
the  preemption  happened  while  holding  a  lock  in  the 
virtual  memory  system,  the  recursion  could  cause  the 
thread  to  deadlock  with  itself. 

The  problem  here  is  that  careless  ordering  of  the 
PC  check  could  lead  to  mutual  recursion  between  the 
thread  scheduler  and  the  virtual  memory  system.  Such 


*  At  CMU,  we  rely  on  user-level  restart  in  a  preemptive  corou¬ 
tine  package  for  Unix  systems  that  is  used  in  teaching  an  under¬ 
graduate  operating  systems  course.  We  examine  the  interrupted 
PC  within  the  Unix  sign^d  handler,  and  roll  it  back  if  necess^u'y. 
With  this,  we  avoid  disabling  and  enabling  Unix  signals  during 
every  synchronization  ooeration. 


problems  are  avoided  in  Taos  because  the  system  is 
structured  to  impose  a  strict  ordering  on  the  four  events 
listed  above.  The  handling  of  any  event  can  cause  only 
lower-level  events.  A  •.  ^  page  fault  can  incur  kernel 
page  faults  and  kernel  mptions,  but  a  kernel  pre¬ 
emption  (including  tht  !  check  at  restart)  can  not 
incur  kernel  page  faults,  iiesuming  from  a  user  pre¬ 
emption,  by  contrast,  is  allowed  to  incur  page  faults. 
By  consistently  ordering  the  PC  checks,  Taos  is  able 
to  use  restartable  atomic  sequences  at  all  levels  of  the 
system  without  risk  of  deadlock  or  endless  recursion. 

5  The  performance  of  three 
software  techniques  for  mu¬ 
tual  exclusion 

In  this  section  we  compare  tie  performance  of 
restartable  atomic  sequences,  kernel  emulation  and 
software  reservation  on  a  RISC-based  DECstation 
5000/200  running  the  Mach  3.0  kernel  (version  MK42) 
and  CMU’s  Unix  server  (version  UX23)  [Golub  et  al. 
90].  The  DECstation  5000/200  has  a  25  Mhz  MIPS 
R3000  processor  which  does  not  support  atomic  read- 
modify-write  memory  accesses  in  hardware. 

We  discuss  performance  at  three  levels.  First,  we 
examine  the  basic  overhead  of  the  various  mechanisms. 
Next,  we  examine  their  effect  on  the  performance  of 
common  thread  management  operations.  Finally,  we 
take  a  system-wide  perspective  and  look  at  the  effect 
that  mutual  exclusion  overhead  has  on  the  performance 
of  several  applications.  In  brief,  we  show  that: 

•  Using  restartable  atomic  sequences  instead  of 
kernel-emulation,  the  performance  of  multi¬ 
threaded  applications  can  be  improved  substan¬ 
tially. 

•  Even  single  threaded  applications,  because  they 
deal  with  multithreaded  operating  system  servers, 
can  benefit  indirectly  from  inexpensive  mutual  ex¬ 
clusion. 

•  Thread  suspensions  occur  much  less  frequently 
than  atomic  operations,  justifying  the  small 
amount  of  extra  .work  done  during  thread  switch 
in  order  to  improve  the  performance  of  atomic  op¬ 
erations. 

•  Restartable  atomic  sequences  are  almost  never  in¬ 
terrupted,  validating  the  optimistic  approach. 

Although  we  have  not  collected  detailed  performance 
information  in  Taos,  we  believe  that  the  results  would 
be  similar. 

5.1  Microbenchmarks 

We  compare  the  performance  of  the  three  software- 
based  mutual  exclusion  mechanisms  with  a  test  that 


enters  a  critical  section  using  a  TesUAnd-Sei  lock,  in¬ 
crements  a  counter,  and  leaves  the  critical  section  by 
clearing  the  Tesi-And-Sei  lock.  The  test  uses  only  one 
thread,  so  the  Tesi-And-Set  always  succeeds.  Conse¬ 
quently,  we  are  not  measuring  the  performance  of  the 
thread  management  system  itself  (context  switching, 
scheduling,  etc.),  but  rather  that  of  the  basic  proces¬ 
sor  architecture,  memory  system  and  mutual  exclusion 
mechanism.  The  update  to  the  counter  is  included  so 
as  to  model  a  real  critical  section:  interactions  between 
the  atomic  operation,  the  code  in  the  critical  section, 
and  the  memory  system  should  be  considered  when 
evaluating  a  mutual  exclusion  mechanism.  For  exam¬ 
ple,  a  scheme  requiring  several  writes  will  not  work  well 
on  a  memory  system  with  a  write-through  cache  and  a 
shallow  write-buffer  [Bershad  et  al.  92]. 

The  elapsed  times  to  execute  the  various  software- 
based  mutual  exclusion  algorithms  are  shown  in  Ta¬ 
ble  1.  The  values  in  the  table  were  determined  by  ex¬ 
ecuting  the  test  in  a  tight  loop  1,000,000  times,  com¬ 
puting  the  average  elapsed  time  of  each  pass  through 
the  loop,  and  subtracting  off  the  loop  overhead.  There 
was  only  negligible  variation  in  times  over  several  runs 
of  the  benchmarks  on  an  unloaded  system. 


Software  Mechanism 

Time 

(psecs) 

Restartal'le  Atomic  Sequences  (branch) 

.64 

Restartable  Atomic  Sequences  (inline) 

.51 

Kernel  Emulation 

4.15 

Software-reservation  (a) 

1.51 

Software-reservation  (b) 

1.16 

Table  1:  Microbenchmark  results  for  the  DECstation 
5000/200. 

Restartable  atomic  sequences  were  measured  with 
branches  to  an  explicitly  registered  sequence,  and  also 
with  inlined  code.  The  performance  difference  between 
the  two  approaches  is  due  to  the  subroutine  linkage 
overhead  on  the  MIPS.  Kernel  emulation  and  both 
reservation  schemes  use  out-of-line  calls  to  implement 
the  atomic  operations.  For  these  mechanisms,  the  over¬ 
head  is  sufficiently  high  that  there  is  little  to  be  gained 
by  inlining.  Software-reservation  protocol  (a)  is  an  im¬ 
plementation  of  Lamport’s  fast  mutual  exclusion  al¬ 
gorithm  in  which  each  lock  is  represented  by  a  data 
structure  containing  an  owner  and  a  reservation  field 
(one  word  each),  and  an  array  of  booleans  indexed  by 
a  thread  identifier.  It  is  ^he  most  direct  implementation 
of  the  algorithm,  but  suffers  from  the  high  storage  re¬ 
quirements  described  in  Section  2.2.  Protocol  (b)  uses 
Lamport’s  algorithm  to  implement  the  “meta”  mutual 
exclusion  function  shown  in  Figure  2.  Protocol  (b),  de¬ 
spite  an  increase  in  the  number  of  memory  accesses  over 
Protocol  (a),  executes  more  quickly  on  the  DECstation 
5000/200  because  of  the  cost  of  having  to  compute  a 
thread’s  unique  identifier  and  the  address  of  its  “busy” 
bit.  With  protocol  (a),  these  must  be  computed  on  en¬ 
try  and  exit  to  a  critical  section,  whereas  with  protocol 


(b),  they  need  only  be  computed  on  entry.  A  dedi¬ 
cated  per-thread  hardware  register  would  reverse  this 
disparity. 

The  table  shows  that  kernel  emulation  is  by  far  the 
most  expensive  approach;  the  trap  and  exception  dis¬ 
patch  in  the  kernel  are  the  main  sources  of  overhead. 
Both  software  reservations  schemes  are  faster  than  ker¬ 
nel  emulation,  but  much  slower  than  restartable  atomic 
sequences  due  to  the  number  of  instructions  and  mem¬ 
ory  accesses  required.  Despite  their  better  perfor¬ 
mance,  both  reservation  strategies  have  practical  prob¬ 
lems  that  make  them  difficult  to  use  (see  Section  2.2). 
Consequently,  in  the  rest  of  this  section,  we  restrict  our 
comparisons  to  systems  using  only  restartable  atomic 
sequences  and  kernel  emulation. 

5.2  Thread  management  overhead 

Mach’s  user-level  thread  management  system,  C- 
Threads,  like  other  thread  management  packages  [An¬ 
derson  et  al.  89,  Bershad  et  al  88,  Weiser  et  al.  89], 
relies  heavily  on  simple  atomic  operations  to  implement 
higher  level  facilities  such  as  threads,  locks  and  condi¬ 
tion  variables.  We  looked  at  several  benchmarks  to 
understand  the  influence  that  atomic  operations  have 
on  the  performance  of  these  higher  level  facilities  using 
two  different  versions  of  C-Threads.  One  version  re¬ 
lies  on  kernel  emulation  for  synchronization.  The  other 
uses  restartable  atomic  sequences.  The  benchmarks, 
which  contain  the  kinds  of  operations  typically  found 
in  multithreaded  programs,  are; 

•  Spinlock.  One  thread  repeatedly  acquires  and  re¬ 
leases  a  spinlock.  The  spinlock  is  implemented 
with  a  TesUAnd-Set  sequence. 

•  Mutexlock.  One  thread  repeatedly  acquires  and 
releases  a  relinquishing  mutex.  Unlike  a  spinlock, 
if  a  thread  tries  to  acquire  a  held  mutex,  it  relin¬ 
quishes  the  processor.  The  mutex  is  implemented 
using  a  spinlock  and  a  queue  of  waiting  threads. 

•  Forktest.  Threads  are  recursively  forked  in  suc¬ 
cession;  i.e.,  thread  1  forks  thread  2  which  forks 
thread  3,  etc..  After  forking,  a  thread  immediately 
terrninates. 

•  Pingpong.  Two  threads  “pingpong”  off  one  an¬ 
other  in  a  tight  loop,  using  a  mutex  and  condition 
variable  to  execute  alternately. 

The  performance  of  these  benchmarks  running  on  a 
DECstation  5000/200  is  shown  in  Table  2.  Each  entry 
in  the  table  represents  the  elapsed  time  per  operation 
(i.e,  one  spinlock  acquire  and  release,  one  mutex  lock 
and  unlock,  one  fork  and  exit,  one  ping  and  pong).  The 
table  shows  that  the  performance  of  thread  manage¬ 
ment  operations  depends  upon  the  performance  of  the 
underlying  synchronization  mechanism.  When  using 
kernel  emulation  for  TesUAnd-Set,  thread  management 


functions  spend  the  majority  of  their  time  in  the  ker¬ 
nel  executing  synchronization  code.  With  restartable 
atomic  sequences,  synchronization  overhead  becomes 
negligible.  Even  PingPong,  with  its  profligate  synchro¬ 
nization  (26  Test-And-Seis  per  cycle),  spends  less  than 
10%  of  its  time  synchronizing  when  using  restartable 
atomic  sequences. 


Benchmark 

Emulation 

(/isecs) 

R.A.S. 

(/isecs) 

Spinlock 

4.3 

.58 

MutexLock 

4.6 

.91 

ForkTest 

43.7 

23.8 

PingPong 

230.8 

115.2 

Table  2:  The  effect  of  synchronization  on  thread  man¬ 
agement  overhead  under  Mach  3.0  on  a  DECstation 
5000/200. 


5.3  Application  performance 

The  microbenchmarks  and  thread  management  bench¬ 
marks  indicate  that  restartable  atomic  sequences  can 
have  a  large  effect  on  individual  operations.  Ultimately, 
though,  we  are  concerned  with  performance  system- 
wide.  In  this  subsection  we  examine  the  effect  that 
restartable  atomic  sequences  have  on  the  performance 
of  several  applications  running  on  Mach  3.0.  The  ap¬ 
plications  are: 

•  text-format.  Format  a  version  of  this  paper  using 
I^TeX. 

•  afs-bench.  A  script  of  file  system  intensive  pro¬ 
grams  such  as  copy,  compile  and  search  that 
execute  out  of  the  Andrew  File  System  [Satya- 
naranyanyan  et  al.  85]. 

•  partkenon-n.  A  resolution-based  theorem  prover 
that  uses  n  threads  to  exploit  or-parallelism  [Bose 
et  al.  89]. 

•  procon-64.  A  producer-consumer  application  in 
which  one  consumer  thread  coordinates  with  one 
producer  thread  to  read  data  from  a  large  file  into 
a  64  byte  buffer. 

Table  3  shows  the  behavior  of  the  applications  when 
run  under  two  different  versions  of  the  operating  sys¬ 
tem.  The  columns  labeled  “Emul”  reflect  runs  using 
kernel  emulation  for  the  application  and  for  Mach’s 
user-level  Unix  server.  The  columns  labeled  “R.A.S.” 
reflect  runs  using  restartable  atomic  sequences  for  the 
applications  and  for  the  Unix  server.  Each  program 
was  run  several  times  and  the  average  values  for  mea¬ 
surements  taken  during  the  runs  are  given  in  the  table. 

Restartable  atomic  sequences  have  the  greatest  ef¬ 
fect  on  applications  that  use  threads  explicitly,  such  as 
Parthenon  with  1  or  10  threads,  and  procon-64  which 


Program 

Elapsed 
Time  (secs) 
Emul.  R.A.S. 

Emulation 

Traps 

Restarts 

Thread 
Suspensions 
Emul.  R.A.S. 

text-format 

10.1 

9.8 

57305r 

0 

317 

afs-bench 

239.4 

231.1 

2191276 

42 

8856 

9876 

Parthenon- 1 

25.8 

18.5 

1395534 

4 

412 

354 

parthenon-10 

26.1 

18.6 

1576714 

7 

610 

499 

procon-64 

30.4 

15.7 

2738168 

4 

106969 

91494 

Table  3:  Effect  of  synchronization  overhead  on  application  performance. 


improve  by  about  30%  and  50%  respectively.  Single- 
threaded  “vanilla  Unix”  applications  also  benefit  in¬ 
directly  through  the  improved  performance  of  multi¬ 
threaded  user-level  operating  system  services.  For  ex¬ 
ample,  the  performance  of  the  text- formatter  and  the 
file  system  benchmarks,  which  are  themselves  single 
threaded  but  rely  on  the  multithreaded  Unix  server, 
improves  by  about  3%. 

The  column  labeled  “Emulation  Traps”  reflects  the 
number  of  synchronizations  that  occurred  when  atomic 
operations  were  implemented  in  the  kernel.  The  col¬ 
umn  labeled  “Restarts”  shows  the  average  number  of 
atomic  sequence  restarts  that  had  to  be  performed 
when  Tesi-And-Set  was  implemented  with  explicit  reg¬ 
istration.  The  restart  count  demonstrates  that  the  like¬ 
lihood  of  a  thread  being  suspended  during  a  restartable 
atomic  sequence  is  extremely  small. 

The  last  two  columns  show  the  number  of  times  that 
the  kernel  suspended  a  thread.  For  restartable  atomic 
sequences,  it  indicates  how  many  times  a  thread’s  ex¬ 
ecution  state  had  to  be  checked  to  ensure  that  atomic 
operations  eventually  execute  atomically.  Comparing 
this  column  to  the  number  of  emulation  faults  justifies 
the  small  amount  of  extra  work  required  by  the  restart 
strategies  whenever  a  thread  is  rescheduled.  The  more 
compelling  justification,  of  course,  is  the  reduced  exe¬ 
cution  time  for  the  applications. 

The  number  of  emulation  traps  can  be  used 
to  account  for  the  performance  difference  between 
the  two  versions  of  the  system.  For  examj-ie, 
parthenon-10,  with  its  1.57  million  kernel  emulations, 
should  improve  by  about  1.57  million  x  3.7  /rsecs 
(4.3  /isecs— .58  /isecs),  or  about  5.8 seconds.  The  actual 
improvement  is  slightly  greater  than  this  for  two  rea¬ 
sons.  First,  the  correlation  between  elapsed  time  and 
number  of  emulation  traps  is  neither  strictly  negative 
nor  strictly  positive.  Hence,  the  number  of  emulation 
traps  is  only  a  good,  but  not  exact,  predictor  of  per¬ 
formance  improvement.  Second,  some  of  the  improve¬ 
ment  is  due  to  the  reduction  in  scheduling  overhead 
that  comes  with  a  decrease  in  critical  section  service 
time. 

For  even  very  short  critical  sections  (10  to  20  in¬ 
structions)  restartable  atomic  sequences  add  little  ex¬ 
tra  overhead,  and  much  of  that  overhead  comes  before 
the  critical  section  has  actually  been  entered.  Conse¬ 
quently,  a  short  critical  section  remains  short,  and  the 
likelihood  of  the  critical  section  itself  being  suspended  is 


small.  With  kernel  emulation,  though,  each  Test-And- 
5e<  takes  about  100  instructions,  and  nearly  all  are  ex¬ 
ecuted  with  processor  interrupts  disabled.  When  con¬ 
trol  returns  out  of  the  kernel,  interrupts  are  reenabled 
and  any  pending  interrupts  are  delivered.  If  the  de¬ 
livered  interrupt  causes  a  preemption,  then  the  thread 
that  just  performed  the  atomic  operation  will  be  de- 
scheduled  and  another  thread  will  run.  If  that  thread 
attempts  to  enter  the  same  critical  section,  it  will  find 
the  Tesi-And-Set  variable  already  set  and  will  relin¬ 
quish  its  processor  to  the  scheduler. 

We  looked  more  closely  at  parthenon-10  to  determine 
the  influence  of  inflated  critical  sections  on  program 
behavior.  The  program  synchronizes  often,  but  most 
synchronization  operations  guard  short  critical  sections 
that  simply  increment  a  counter,  or  dequeue  an  item 
from  a  linked  list.  In  running  the  program,  we  counted 
the  number  of  times  that  a  thread  was  unable  to  en¬ 
ter  a  critical  section  because  of  a  lock  held  by  another 
(suspended)  thread.  When  using  kernel  emulation  in 
parthenon-10,  a  thread  found  a  Test-And-Set  lock  held 
about  twice  as  often  as  with  restartable  atomic  se¬ 
quences. 

6  Software  vs.  hardware  sup¬ 
port  for  mutual  exclusion 

The  lack  of  hardware  support  for  atomic  operations  of¬ 
fered  the  initial  motivation  to  investigate  efficient  soft¬ 
ware  solutions  [Anderson  et  al.  91].  Most  processors, 
however,  do  support  some  type  of  atomic  read-modify- 
write  instruction.  In  this  section  wc  evaluate  the  use  of 
restartable  atomic  sequences  on  such  processors. 

We  measured  the  overhead  to  acquire  and  release 
a  Test-And-Set  lock  using  memory-interlocked  instruc¬ 
tions  and  restartable  atomic  sequences  on  eight  proces¬ 
sor  architectures.  The  results  are  shown  in  Table  4. 
For  the  interlocked  cases,  the  times  do  not  include  any 
linkage  overhead,  as  the  Test-And-Set  and  subsequent 
release  instructions  can  be  executed  inline.  In  the  cases 
of  explicit  registration,  linkage  overhead  is  included  for 
the  Test-And-Set,  but  not  for  the  release,  which  can 
be  inlined.  The  fourth  column  of  Table  4  shows  the 
call  linkage  overhead.  Even  with  the  linkage  overhead, 
restartable  atomic  sequences  are  more  efficient  than 
memory-interlocked  instructions  on  the  DEC  CVAX, 


Processor 

Interlocked 

Instruction 

(/isecs) 

Explicit 

Registration 

(/isecs) 

linkage 

Overhead 

(/isecs) 

Designated 

Sequence 

(/ise.cs) 

■  III  1  1 

2.S 

2.2 

.6 

1.6 

Motorola  68030 

1.1 

2.0 

.8 

1.2 

Intel  386 

1.0 

1.6 

.7 

.9 

Intel  486 

.7 

.6 

.3 

.3 

Intel  860 

.3 

.4 

.2 

.2 

Motorola  88000 

.9 

.3 

.1 

.2 

Sun  SPARC 

.8 

1.0 

.3 

.7 

HP  9000  Series  700 

.94 

.17 

.08 

.09 

Table  4:  Hardware  and  software  overheads  of  Tesi-And-Sei  using  different  implementation  strategies. 


the  Intel  486,  the  Motorola  /OOO,  and  the  Hewlett 
Packard  9000  (PA-RISC)  Sen 700. 

Using  designated  sequences,  the  software  approach 
outperforms  the  hardware  in  all  cases  (subtract  the 
overhead  of  linkage  from  that  of  an  explicitly  registered 
sequence).  As  processor  speeds  increase  relative  to  bus 
and  memory  speeds,  we  expect  the  optimistic  software 
solution  to  continue  its  dominance.  F^or  interlocked  in¬ 
structions  to  outperform  optimistic  software  techniques 
on  uniprocessors,  they  must  be  implemented  so  that 
they  exploit  the  simpler  single  processor  case. 

The  table  demonstrates  that  one  should  not  neces¬ 
sarily  rely  on  an  architecture  and  memory  system  to 
provide  functions  that  may  be  provided  more  cheaply 
with  a  combination  of  operating  system,  compiler,  and 
runtime  support. 

7  Related  work 

The  Trellis/Owl  object-oriented  language  [Moss  & 
Kohler  87]  used  optimistic  synchronization  techniques 
similar  to  those  described  in  this  paper.  The  Owl 
runtime  system  provided  concurrency  among  several 
threads  sharing  a  single  VMS  process,  and  used  soft¬ 
ware  interrupts  from  VMS  to  drive  its  multiplexing.  It 
provided  atomicity  for  its  own  needs  and  those  of  user 
programs  by  backing  out  of  certain  registered  runtime 
routines,  and  by  emulating  forward  through  designated 
sequences.  The  most  important  difference  between  Owl 
and  the  work  described  in  this  paper  is  our  integration 
of  restartable  atomic  sequences  with  the  operating  sys¬ 
tem  kernel. 

User-level  detection  and  restart  is  similar  to  the  a])- 
proach  taken  in  [Anderson  et  al.  92]  to  support  user- 
level  thread  management  on  shared  memory  multipro¬ 
cessors.  In  that  system,  when  a  thread  is  preempted 
inside  a  critical  section,  it  is  immediately  resumed  not 
where  it  left  off,  but  within  code  that  gives  the  thread 
management  system  the  opportunity  to  recover  from 
the  preemption.  This  machinery  is  sufficient  for  imple¬ 
menting  restartable  atomic  sequences  on  a  uniproces¬ 
sor. 

The  Intel  i860  processor  [Intel860  89]  provides  hard¬ 
ware  support  for  restartable  sequences.  A  thread  be¬ 


gins  a  multi-instruction  atomic  sequence  -with  a  Special 
instruction  that  sets  a  bit  in  the  processor  status  word, 
disables  interrupts,  and  locks  the  bus.  The  bit  is  cleared 
and  the  bus  lock  is  automatically  released  on  the  next 
write  to  memory,  after  32  cycles,  or  on  a  processor  ex¬ 
ception.  The  release  on  write  covers  the  common  case 
of  a  successful  read-modify-write  sequence.  The  kernel 
must  check  the  bit  on  every  transfer  from  the  kernel 
to  user  level.  If  the  bit  is  set,  the  kernel  must  back 
the  thread  up  to  the  special  instruction.  Despite  the 
i860’s  hardware  support  for  restartable  sequences  (the 
bit  in  the  processor  status  word  eliminates  the  need 
to  perform  explicit  registration  or  instruction  stream 
inspection  after  every  context  switch),  it  offers  little 
performance  advantage  over  software  techniques  on  a 
uniprocessor  (see  Table  4). 


8  Conclusions 

Restartable  atomic  sequences  represent  a  “common 
case”  approach  to  mutual  exclusio.i  on  a  uniprocessor. 
In  the  common  case,  ai  atomic  operation  runs  unin¬ 
terrupted.  The  uncommon  case  can  be  detect  d  after 
it  occurs  and  can  be  handled  by  means  of  a  simple  re¬ 
covery  process.  As  such,  restartable  atoniir  .sequences 
are  appropriate  for  uniprocessors  that  do  not  support 
memory-interlocked  atomic  instructions.  Moreover,  on 
processors  that  do  have  hardware  support  for  syn¬ 
chronization,  better  performance  may  be  possible  with 
restartable  atomic  sequences. 
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